Finite Sample Approximation Results for Principal Component Analysis: a Matrix Perturbation Approach
نویسندگان
چکیده
Principal Component Analysis (PCA) is a standard tool for dimensional reduction of a set of n observations (samples), each with p variables. In this paper, using a matrix perturbation approach, we study the non-asymptotic relation between the eigenvalues and eigenvectors of PCA computed on a finite sample of size n, to those of the limiting population PCA as n → ∞. As in machine learning, we present a finite sample theorem which holds with high probability for the closeness between the leading eigenvalue and eigenvector of sample PCA and population PCA under a spiked covariance model. In addition, we also consider the relation between finite sample PCA and the asymptotic results in the joint limit p, n →∞, with p/n = c. We present a matrix perturbation view of the “phase transition phenomenon”, and a simple linear-algebra based derivation of the eigenvalue and eigenvector overlap in this asymptotic limit. Moreover, our analysis also applies for finite p, n where we show that although there is no sharp phase transition as in the infinite case, either as a function of noise level or as a function of sample size n, the eigenvector of sample PCA may exhibit a sharp ”loss of tracking”, suddenly losing its relation to the (true) eigenvector of the population PCA matrix. This occurs due to a crossover between the eigenvalue due to the signal and the largest eigenvalue due to noise, whose eigenvector points in a random direction.
منابع مشابه
Finite Sample Approximation Results for Principal Component Analysis: a Matrix Perturbation Approach1 by Boaz Nadler
Principal component analysis (PCA) is a standard tool for dimensional reduction of a set of n observations (samples), each with p variables. In this paper, using a matrix perturbation approach, we study the nonasymptotic relation between the eigenvalues and eigenvectors of PCA computed on a finite sample of size n, and those of the limiting population PCA as n→∞. As in machine learning, we pres...
متن کاملCombined Unfolded Principal Component Analysis and Artificial Neural Network for Determination of Ibuprofen in Human Serum by Three-Dimensional Excitation–Emission Matrix Fluorescence Spectroscopy
This study describes a simple and rapid approach of monitoring ibuprofen (IBP). Unfolded principal component analysis-artificial neural network (UPCA-ANN) and excitation-emission spectra resulted from spectrofluorimetry method were combined to develop new model in the determination of IBF in human serum samples. Fluorescence landscapes with excitation wavelengths from 235 to 265 nm and emission...
متن کاملCombined Unfolded Principal Component Analysis and Artificial Neural Network for Determination of Ibuprofen in Human Serum by Three-Dimensional Excitation–Emission Matrix Fluorescence Spectroscopy
This study describes a simple and rapid approach of monitoring ibuprofen (IBP). Unfolded principal component analysis-artificial neural network (UPCA-ANN) and excitation-emission spectra resulted from spectrofluorimetry method were combined to develop new model in the determination of IBF in human serum samples. Fluorescence landscapes with excitation wavelengths from 235 to 265 nm and emission...
متن کاملAccurate Error Bounds for the Eigenvalues of the Kernel Matrix
The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to infinity. We derive probabilistic finite sample size bounds on the approximation error of individual eigenvalues which have the important property th...
متن کاملSmooth Sensitivity Based Approach for Differentially Private Principal Component Analysis
We consider the challenge of differentially private PCA. Currently known methods for this task either employ the computationally intensive exponential mechanism or require an access to the covariance matrix, and therefore fail to utilize potential sparsity of the data. The problem of designing simpler and more efficient methods for this task has been raised as an open problem in [19]. In this p...
متن کامل